1 



a^ 

CD 

a^ 

(N 



> 
o 



X 



SSOR Preconditioning of Improved Actions * 

N. Eicker'', W. Bietenholz'', A. Frommer'^, H. Hoeber'', Th. Lipperf^, K. Schilling'^^'' 
^HLRZ, c/o Research Center Jiilich, D-52425 Jiilich, Germany 

''Department of Mathematics, University of Wuppertal, D-42097 Wuppertal, Germany 
■^Department of Physics, University of Wuppertal, D-42097 Wuppertal, Germany 

We generalize /ocal lexicographic SSOR preconditioning for the Sheikholeslami-Wohlert improved Wilson 
fermion action and the truncated perfect free fermion action. In our test implementation we achieve perfor- 
mance gains as known from SSOR preconditioning of the standard Wilson fermion action. 



1. INTRODUCTION 

The standard Wilson fermion action of lattice 
QCD leads to discretization-errors of 0{a) in lat- 
tice spacing, requiring prohibitively fine lattice 
resolutions in the approach to the chiral and con- 
tinuum limits iQ] . The present trend to tackle this 
problem goes in two directions: (i) One approach 
is based on Symanzik's on- shell improvement ^n^o- 
gram, where irrelevant 0(a) counter terms are 
added to both, lattice action (Sheikholeslami- 
Wohlert- Wilson action SWA) and composite op- 
erators 1^. The hope is to reach the continuum 
limit for a specific observable 0{a) — Ocont. + 
C20? + . . . without Oia) contamination, [ii) An- 
other promising ansatz is based on so-called per- 
fect actions that are located on renormalized tra- 
jectories intersecting the critical surface in a fixed 
point of a renormalization group transformation 
|3|. Perfect actions are in principle free of cut- 
off effects. However, they can only be real- 
ized approximatively as truncated perfect actions 
(TPA). 

Simulations of dynamical fermions within these 
schemes meet the problem of the compute in- 
tensive solutions of the fermionic linear system 
Mx = 4>, well known from traditional actions. In 
the last three years, a considerable acceleration 
of the inversion of the standard Wilson fermion 
matrix has been achieved by introduction of the 
BiCGStab algorithm B and novel parallel /ocal- 



/exicographic SSOR preconditioning techniques 
1^. Obviously, the efforts should be combined, 
i.e. ZZ-SSOR generalized for SWAs and TPAs in 
order to gain their full pay-oflQ. 

In general, both SWA and TPA can be written 
in the form 



M = A + B + C 



(1) 



where A represents diagonal blocks (containing 
12 X 12 sub-blocks), B is a nearest-neighbor hop- 
ping term, C contains next-to-nearest-neighbor 
couplings. Usually next-next-nearest-neighbor 
couplings are truncated. 

Our key observation is that one can include into 
the I/-SSOR process (i) the internal degrees of 
freedom of the block diagonal term A as arising 
in SWA and {ii) next-to-nearest- neighbor terms 
C as present in TPA. 

2. PRECONDITIONING SWA 

Preconditioning amounts to the replacement of 
M, X and by preconditioned quantities M, x 
and 4>. The aim is to transform the matrix such 
that the spectrum becomes narrower, increasing 
the efficiency of the inversion. The matrix-vector 
multiplication is replaced by 



Vi = Mpi 



solve Pzi — Pi 



(2) 



*Talk presented by N. Eicker. 



^In Ref. ||7|, odd-even preconditioning has been applied to 
the Sheikholeslami-Wohlert- Wilson action. 
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P represents the preconditioning matrix. It can 
be decomposed into a product of three regular 
matrices P = RST. This allows to apply the 
'Eisenstat-trick' |^ using the identity M = R + 
T — K, with a fourth regular matrix K. 

Let us recall the essentials of ZZ-SSOR §: The 
matrix M is decomposed into a (block-) diagonal 
part D, and two strictly upper and lower trian- 
gular parts U and L, M = D — L — U. This can 
be achieved by local lexicographic ordering, de- 
scribed in The choices for R, S and T are 

R = Id ~ L, S = (^D)~^ and T ^ ^D-U. 
Here co is an over-relaxation parameter to be cho- 
sen appropriately. 

These special choices of R, S and T simplify the 
task of solving the linear equation in (|^). They 
lead to the following replacement of the matrix- 
vector multiplication: 



multiply by D 
solve backward 
solve forward 



(3) 



For standard Wilson fermions, the block- 
diagonal term is given by A oc 1, which implies 
the natural choice of I? oc 1 in the SSOR scheme. 
Therefore, in (^) the multiplication by D and the 
multiplication by in the forward-/backward- 
solve is readily carried out. In the multiplication 
with a diagonal block, the 12 color-spin elements 
in the vector x are decoupled and can be treated 
simultaneously. 

The situation changes if A is not a strict diago- 
nal. We have the freedom to choose the splitting 
of A into the diagonal term D and the upper and 
lower terms U and L. As efficient implementa- 
tions require to store D~^, we thus can control 
the memory overhead. However, depending on 
the choice of D the elements of x are intermixed in 
the multiplication with a diagonal block. There- 
fore they can only be treated simultaneously, if 
we choose the diagonal part as D = A, the choice 
with the largest memory overhead. For any other 
choice, SSOR subprocesses on the diagonal blocks 
have to be introduced. 

As a test we have implemented the /Z-SSOR 
preconditioning scheme within BiCGStab for 
SWA. The diagonal part of the related quark ma- 



trix contains four complex 3x3 matrices Fi: 
( 
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This structure reduces the storage requirements 
by a factor of 4 and is well suited to a QCD op- 
timized machine. 

We tested the inverter in a IG"* pure gauge back- 
ground at /3 = 6.0 for two choices of D, (i) the 
true diagonal (true) with twelve 1x1 blocks and 
(ii) the l±Fi blocks (block) as shown in (^. The 
csiy-para-ineter was chosen as 1.0 and 1.6. We 
tested the algorithm for different values of k on 4 
field-configurations. The tests were done on the 
32-node APEIOO/Quadrics Q4 in Wuppertal. We 
compare to unpreconditioned BiCGStab rescaled 
by a factor 2 to mimic odd-even preconditioning 
as a reference. 
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Figure 1. Number of iterations at csw = l-O- 



Fig. Ijshows iteration numbers at csw — 1-0 for 
both implementations. BiCGStab/2 represents 
the estimate for the odd-even inverter, u — \S) 
stands for the case without over-relaxation, the 
optimal value is w — 1.5. The gain is about a 
factor of 2 against BiCGStab/2 a\ uj — 1; over- 
relaxation yields another 10 to 20% gain. The dif- 
ference between the two choices of D is not signif- 
icant. The results shown in Fig. ^ for csw = 1-6 
are qualitatively identical to the csw = 1-0 case. 



3 



140 
120 
100 

80 

60 

40 

20 A 



0)= 1.0 
(0=1.5 
0)= 1.0 
(0=1.5 
BiCGStab/2 



block 



-m 



i m 



m ^ ^ 





0.126 0.128 0.13 0.132 0.134 0.136 0.138 
K 

Figure 2. Number of iterations at csw = 1-6- 

3. PRECONDITIONING TPA 

Next we consider a perfect free lattice fermion 
action for arbitrary mass |^ . As the couplings de- 
cay exponentially, a practical truncation scheme 
confines the couplings to a unit hypercube 

The matrix for this "hypercube fermion" (HF), 
albeit with j4 oc 1, seems considerably more com- 
plicated than the Wilson fermion matrix, due to 
contributions of type C and beyond. But H-SSOR 
preconditioning and the Eisenstat-trick remain 
applicable. We will present a detailed treatement 
elsewhere [p^ . 

At this stage we discuss the effect of precondi- 
tioning by recourse to the multi-color approach, 
the extension of the "red-black" scheme. This 
leads us to 2'^ non-interacting sub-lattices. We 
obtain many off-diagonal blocks that are fortu- 
nately largely suppressed. Denoting the maximal 
magnitude of the elements in L and U as 0(e), we 
apply the analog to the odd-even transformation 
and get 

M' = 1 - L^)(^ UJ) = 1 - LU - 0(£«). 



i>l 



We expect the spectrum to be much closer to 1 
since the eigenvalues of M' are all 1 — O(e^) (for 
M they are 1 - 0{e)). 

Fig. ^ shows that the parameter e obtained 
is smaller for the HF than that for the Wilson 
fermion, since the lattice derivative is somehow 
"smeared over the hypercube" . Thus we expect 
multi-color (and also SSOR) preconditioning to 
work very well. 
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Figure 3. The off-diagonal magnitude e for the 
Wilson fermion (r = 1) and for the perfect trun- 
cated fermion, as a function of the mass. 



4. CONCLUSIONS 

We demonstrated that the application of the U- 
SSOR preconditioning scheme leads to the most 
efficient preconditioning known for improved ac- 
tions. Our method saves a large factor in memory 
compared to odd-even preconditioning. 
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